Utilities for quickly accessing daily weather data from over 9,000 stations

Introducing databrew’s gsod package

Joe Brew | December 8, 2017

Introduction

Weather data matters to many: Weather data is useful across many disciplines: epidemiologists use it to predict the incidence of diseases associated with rainfall (malaria) or humidity (influenza); economists monitor for its effect on things like crop yield, commodity prices, or even violence. It’s a topic of daily elevator conversation and human interest.

Weather data is inaccessible to most: But to most analysts, even those with a great deal of experience, accessing consistent, high quality weather data is difficult. There are plenty of services with current conditions and forecasts, but not historical information. When data is available, it’s often unclear which combination of sources was used, the extent to which values are estimated versus observed, nor how the data were retrieved or processed prior to exposition. In fact, the market for weather data is so opaque that an entire industry has sprung up around packaging government weather data into easy-to-use formats (APIs and bulk downloads).

The gsod package

That’s where the gsod package comes in. High quality, de-aggregated weather data exists, and is usually both collected and maintained by governments. Taxpayers shouldn’t have to pay twice for the privelege of accessing it. The US National Oceanic and Atmospheric Administration’s Global Surface Summary of the Day data, for example, go back for nearly two dozen years, and contain information such as temperature, wind, humidity, rainfall, etc. from more than 9,000 weather stations. DataBrew’s new gsod package makes using those data easy, even for beginners.

Let’s explore the weather.

To get started, head over to the gsod package page and follow instructions for package download / installation. Got it? Good. Now let’s settle the debate once and for all: which has better weather, the Dutch capital of Amsterdam or the Catalan capital of Barcelona.

# Get weather for 2012-2016
                        library(gsod)
                        weather <- bind_rows(gsod2012,
                                             gsod2013,
                                             gsod2014,
                                             gsod2015,
                                             gsod2016)

Now we’ll find the nearest weather stations to our locations of interest through the gsod package’s find_nearest_station function:

# Get the station ids
                        ids <- bind_rows(find_nearest_station('Amsterdam, Netherlands'),
                                         find_nearest_station('Barcelona, Catalonia'))
                        ids
stnid lon lat km stn_name
062400-99999 4.764 52.309 11.65975 SCHIPHOL
081810-99999 2.078 41.297 13.80614 BARCELONA

Next, we’ll plot our weather stations on a world map:

library(rworldmap)
                        library(ggplot2)
                        library(ggrepel)
                        library(databrew)
                        world <- map_data(map="world")
                        
                        g <- 
                          ggplot() + 
                          theme_databrew() +
                          geom_map(data=world, 
                                   map=world,
                                   aes(map_id=region, x=long, y=lat),
                                   fill = 'lightblue',
                                   alpha = 0.8) +
                          geom_point(data = ids,
                                     aes(x = lon,
                                         y = lat),
                                     alpha = 0.6,
                                     size = 3,
                                     color = 'red') +
                          labs(x = 'Longitude',
                               y = 'Latitude') +
                          theme(legend.position = 'bottom') +
                          geom_label_repel(data = ids,
                                     aes(x = lon,
                                         y = lat,
                                         label = stn_name),
                                     alpha = 0.7,
                                     size = 2)

To start, we’ll have a look at the daily high and low temperatures.

# Get weather for only our two stations
                        weather <- weather %>%
                          filter(stnid %in% ids$stnid)
                        # Make long tidy data for plotting temperatures
                        long <- weather %>%
                          dplyr::select(date, max, min, stn_name) %>%
                          gather(key, value, max:min)
                        # Compare temperature
                        g <- ggplot(data = long,
                               aes(x = date,
                                   y = value,
                                   color = key)) +
                          geom_line(alpha = 0.6) +
                          facet_wrap(~stn_name) +
                          theme(legend.position = 'bottom') +
                          scale_color_manual(name = '',
                                             values = c('blue', 'red')) +
                          theme_databrew() +
                          labs(x = 'Date',
                               y = 'Degrees (celcius)')

For many, rainfall matters more than temperature. Let’s have a look at the percentage of days per month which have any precipitation:

rain <- weather %>%
                          mutate(month = format(date, '%m')) %>%
                          group_by(month, stn_name) %>%
                          summarise(rainy = length(which(prcp > 0)),
                                    days = n()) %>%
                          ungroup %>%
                          mutate(p = rainy / days * 100)
                        
                        ggplot(data = rain,
                               aes(x = month,
                                   y = p,
                                   color = stn_name,
                                   group = stn_name)) +
                          geom_line() +
                          theme_databrew() +
                          scale_color_manual(name = '',
                                             values = c('blue', 'red')) +
                          labs(x = 'Month',
                               y = 'Percentage',
                               title = 'Percentage of days which are rainy') +
                          ylim(0,100)

Conclusion

Amsterdam is great if you like the cold and the rain.

Back to Blogs →